167 research outputs found

    Transforming Time Series for Efficient and Accurate Classification

    Get PDF
    Time series data refer to sequences of data that are ordered either temporally, spatially or in another defined order. They can be frequently found in a variety of domains, including financial data analysis, medical and health monitoring and industrial automation applications. Due to their abundance and wide application scenarios, there has been an increasing need for efficient machine learning algorithms to extract information and build knowledge from these data. One of the major tasks in time series mining is time series classification (TSC), which consists of applying a learning algorithm on labeled data to train a model that will then be used to predict the classes of samples from an unlabeled data set. Due to the sequential characteristic of time series data, state-of-the-art classification algorithms (such as SVM and Random Forest) that performs well for generic data are usually not suitable for TSC. In order to improve the performance of TSC tasks, this dissertation proposes different methods to transform time series data for a better feature extraction process as well as novel algorithms to achieve better classification performance in terms of computation efficiency and classification accuracy. In the first part of this dissertation, we conduct a large scale empirical study that takes advantage of discrete wavelet transform (DWT) for time series dimensionality reduction. We first transform real-valued time series data using different families of DWT. Then we apply dynamic time warping (DTW)-based 1NN classification on 39 datasets and find out that existing DWT-based lossy compression approaches can help to overcome the challenges of storage and computation time. Furthermore, we provide assurances to practitioners by empirically showing, with various datasets and with several DWT approaches, that TSC algorithms yield similar accuracy on both compressed (i.e., approximated) and raw time series data. We also show that, in some datasets, wavelets may actually help in reducing noisy variations which deteriorate the performance of TSC tasks. In a few cases, we note that the residual details/noises from compression are more useful for recognizing data patterns. In the second part, we propose a language model-based approach for TSC named Domain Series Corpus (DSCo), in order to take advantage of mature techniques from both time series mining and Natural Language Processing (NLP) communities. After transforming real-valued time series into texts using Symbolic Aggregate approXimation (SAX), we build per-class language models (unigrams and bigrams) from these symbolized text corpora. To classify unlabeled samples, we compute the fitness of each symbolized sample against all per-class models and choose the class represented by the model with the best fitness score. Through extensive experiments on an open dataset archive, we demonstrate that DSCo performs similarly to approaches working with original uncompressed numeric data. We further propose DSCo-NG to improve the computation efficiency and classification accuracy of DSCo. In contrast to DSCo where we try to find the best way to recursively segment time series, DSCo-NG breaks time series into smaller segments of the same size, this simplification also leads to simplified language model inference in the training phase and slightly higher classification accuracy. The third part of this dissertation presents a multiscale visibility graph representation for time series as well as feature extraction methods for TSC, so that both global and local features are fully extracted from time series data. Unlike traditional TSC approaches that seek to find global similarities in time series databases (e.g., 1NN-DTW) or methods specializing in locating local patterns/subsequences (e.g., shapelets), we extract solely statistical features from graphs that are generated from time series. Specifically, we augment time series by means of their multiscale approximations, which are further transformed into a set of visibility graphs. After extracting probability distributions of small motifs, density, assortativity, etc., these features are used for building highly accurate classification models using generic classifiers (e.g., Support Vector Machine and eXtreme Gradient Boosting). Based on extensive experiments on a large number of open datasets and comparison with five state-of-the-art TSC algorithms, our approach is shown to be both accurate and efficient: it is more accurate than Learning Shapelets and at the same time faster than Fast Shapelets. Finally, we list a few industrial applications that relevant to our research work, including Non-Intrusive Load Monitoring as well as anomaly detection and visualization by means for hierarchical clustering for time series data. In summary, this dissertation explores different possibilities to improve the efficiency and accuracy of TSC algorithms. To that end, we employ a range of techniques including wavelet transforms, symbolic approximations, language models and graph mining algorithms. We experiment and evaluate our approaches using publicly available time series datasets. Comparison with the state-of-the-art shows that the approaches developed in this dissertation perform well, and contribute to advance the field of TSC

    Slab control on the mega-sized North Pacific ultra-low velocity zone.

    Get PDF
    Ultra-low velocity zones (ULVZs) are localized small-scale patches with extreme physical properties at the core-mantle boundary that often gather at the margins of Large Low Velocity Provinces (LLVPs). Recent studies have discovered several mega-sized ULVZs with a lateral dimension of ~900 km. However, the detailed structures and physical properties of these ULVZs and their relationship to LLVP edges are not well constrained and their formation mechanisms are poorly understood. Here, we break the degeneracy between the size and velocity perturbation of a ULVZ using two orthogonal seismic ray paths, and thereby discover a mega-sized ULVZ at the northern edge of the Pacific LLVP. The ULVZ is almost double the size of a previously imaged ULVZ in this region, but with half of the shear velocity reduction. This mega-sized ULVZ has accumulated due to stable mantle flow converging at the LLVP edge driven by slab-debris in the lower mantle. Such flow also develops the subvertical north-tilting edge of the Pacific LLVP

    Efficient Personalized Federated Learning via Sparse Model-Adaptation

    Full text link
    Federated Learning (FL) aims to train machine learning models for multiple clients without sharing their own private data. Due to the heterogeneity of clients' local data distribution, recent studies explore the personalized FL that learns and deploys distinct local models with the help of auxiliary global models. However, the clients can be heterogeneous in terms of not only local data distribution, but also their computation and communication resources. The capacity and efficiency of personalized models are restricted by the lowest-resource clients, leading to sub-optimal performance and limited practicality of personalized FL. To overcome these challenges, we propose a novel approach named pFedGate for efficient personalized FL by adaptively and efficiently learning sparse local models. With a lightweight trainable gating layer, pFedGate enables clients to reach their full potential in model capacity by generating different sparse models accounting for both the heterogeneous data distributions and resource constraints. Meanwhile, the computation and communication efficiency are both improved thanks to the adaptability between the model sparsity and clients' resources. Further, we theoretically show that the proposed pFedGate has superior complexity with guaranteed convergence and generalization error. Extensive experiments show that pFedGate achieves superior global accuracy, individual accuracy and efficiency simultaneously over state-of-the-art methods. We also demonstrate that pFedGate performs better than competitors in the novel clients participation and partial clients participation scenarios, and can learn meaningful sparse local models adapted to different data distributions.Comment: Accepted to ICML 202

    Understanding Android App Piggybacking:A Systematic Study of Malicious Code Grafting

    Get PDF
    The Android packaging model offers ample opportunities for malware writers to piggyback malicious code in popular apps, which can then be easily spread to a large user base. Although recent research has produced approaches and tools to identify piggybacked apps, the literature lacks a comprehensive investigation into such phenomenon. We fill this gap by 1) systematically building a large set of piggybacked and benign apps pairs, which we release to the community, 2) empirically studying the characteristics of malicious piggybacked apps in comparison with their benign counterparts, and 3) providing insights on piggybacking processes. Among several findings providing insights, analysis techniques should build upon to improve the overall detection and classification accuracy of piggybacked apps, we show that piggybacking operations not only concern app code but also extensively manipulates app resource files, largely contradicting common beliefs. We also find that piggybacking is done with little sophistication, in many cases automatically, and often via library code

    Revisiting Personalized Federated Learning: Robustness Against Backdoor Attacks

    Full text link
    In this work, besides improving prediction accuracy, we study whether personalization could bring robustness benefits to backdoor attacks. We conduct the first study of backdoor attacks in the pFL framework, testing 4 widely used backdoor attacks against 6 pFL methods on benchmark datasets FEMNIST and CIFAR-10, a total of 600 experiments. The study shows that pFL methods with partial model-sharing can significantly boost robustness against backdoor attacks. In contrast, pFL methods with full model-sharing do not show robustness. To analyze the reasons for varying robustness performances, we provide comprehensive ablation studies on different pFL methods. Based on our findings, we further propose a lightweight defense method, Simple-Tuning, which empirically improves defense performance against backdoor attacks. We believe that our work could provide both guidance for pFL application in terms of its robustness and offer valuable insights to design more robust FL methods in the future. We open-source our code to establish the first benchmark for black-box backdoor attacks in pFL: https://github.com/alibaba/FederatedScope/tree/backdoor-bench.Comment: KDD 202

    The Research of Population Genetic Differentiation for Marine Fishes (Hyporthodus septemfasciatus) Based on Fluorescent AFLP Markers

    Get PDF
    Hyporthodus septemfasciatus is a commercially important proliferation fish which is distributed in the coastal waters of Japan, Korea, and China. We used the fluorescent AFLP technique to check the genetic differentiations between broodstock and offspring populations. A total of 422 polymorphic bands (70.10%) were detected from the 602 amplified bands. A total of 308 polymorphic loci were checked for broodstock I (Pbroodstock I = 55.50%) coupled with 356 and 294 for broodstock II (Pbroodstock II = 63.12%) and offspring (Poffspring = 52.88%), respectively. The levels of population genetic diversities for broodstock were higher than those for offspring. Both AMOVA and Fst analyses showed that significant genetic differentiation existed among populations, and limited fishery recruitment to the offspring was detected. STRUCTURE and PCoA analyses indicated that two management units existed and most offspring individuals (95.0%) only originated from 44.0% of the individuals of broodstock I, which may have negative effects on sustainable fry production
    • …
    corecore